To answer the research question of “What is the effect over time of music industry influence on musical artists, in terms of musical content?” measures of popularity, complexity, and outside influence to an artist should be measured over time. These measures are subjective, but justifiable and based in reason from the data. Each created metric will be made for each song of a specific case study artist for which there is valid data. Ultimately the goal will be to compare each of these metrics for all of a case study artsit’s songs which charted at some point on the Billboard Hot 100 and plot each of these against the song’s release date.
Popularity will be measured using the data on a song’s life and behavior when it was on the Billboard Hot 100. Multiple metrics will be proposed and compared. The complexity of a song will be measured by combining musical complexity (if there exists data on it) and lyrical complexity. Finally outside influence will be measured primarily by the number of writers on the song who were non-band members.
All the functions will be defined in a separate source file, and then called in this file. All of the data used here should have already been preprocessed in a previous file.
options(kableExtra.auto_format = FALSE)
library(htmltools)
library(tidyverse)
library(tidytext)
library(data.table)
library(plyr)
library(quanteda)
library(kableExtra)
library(knitr)
library(gridExtra)
library(formattable)
source("frostFunctions.R")
billboardDf = read_csv("FrostData/billboardDataClean.csv", col_types = cols())
spotifyDf = read_csv("FrostData/spotifyDataClean.csv", col_types = cols())
riaaDf = read_csv("FrostData/riaaDataClean.csv", col_types = cols())
grammyDf = read_csv("FrostData/grammyDataClean.csv", col_types = cols())
songSecsDf = read_csv("FrostData/songSectionDataClean.csv", col_types = cols())
songAttrsDf = read_csv("FrostData/songAttrsDataClean.csv", col_types = cols())
Here the functionality will be built to join all data on a chosen artist. For the examples to follow, the band “Maroon 5” will be used.
archArtist = artistDataJoiner("Maroon 5")
validAlbums = c("Red Pill Blues + (Deluxe)", "v (Deluxe)", " Overexposed Track by Track", "Hands all over (Deluxe)", "it Won't be Soon Before Long.", "Songs About Jane")
archArtist = filter(archArtist, Album %in% validAlbums)
#archArtist
Quantifying popularity will be an done in multiple ways to account for imperfections about each metric. There will be multiple popularity metrics, and they can be compared and contrasted across songs. They are as follows: pop1 = sum(1/current * weeks) pop2 = sum(1/current) pop3 = ln(101.1- min(peak)) pop4 = mean(ln(101.1 - current))
Pop1 is a metric which rewards songs which reach their peak on the charts later in their lifetime on the charts so due to this it discrimates against tracks which peak right away and disipate quickly. Pop2 is a metric which does not have an appropriate scale, as having the 2nd spot on the Hot 100 is half as valuable as the number 1 spot. Pop3 only considers the peak position on the chart, but does scale it more appropriately than the first 2. Pop4 uses the natural log scale to more appropriately consider differences in chart position, and takes the mean of all the log chart positions to account for both longevity and position.
archArtistPop = getPopularityMetric(archArtist)
#archArtistPop
ggplot(archArtistPop, aes(x = ReleaseDate, y = pop1)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Pop1 of Maroon 5 Songs By Release Date")
ggplot(archArtistPop, aes(x = ReleaseDate, y = pop2)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Pop2 of Maroon 5 Songs By Release Date")
ggplot(archArtistPop, aes(x = ReleaseDate, y = pop3)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Pop3 of Maroon 5 Songs By Release Date")
ggplot(archArtistPop, aes(x = ReleaseDate, y = pop4)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Pop4 of Maroon 5 Songs By Release Date")
write_csv(archArtistPop, "maroon5_pop.csv")
By each popularity metric made, there is indication of a slight increase in the level of popularity in the Maroon 5’s charting songs as time increases. In pop1 and pop2, a high leverage point likely has some influence on the exact slope of the best fit line.
To consider how much outside of influence was given in the creation of a song, counting the number of writers of the song who are not the artist themselves.
maroon5Members= c("Adam Levine", "Jesse Carmichael", "Mickey Madden", "James Valentine", "Matt Flynn", "PJ Morton", "Sam Farrar", "Ryan Dusick")
archArtistInfluence = getOutsideInfluenceScore(archArtist, maroon5Members)
#archArtistInfluence
ggplot(archArtistInfluence, aes(x = ReleaseDate, y = nonBandMemberWriters)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Number of non-Band Member Writers on Maroon 5's Songs Over Time")
It is apparant that all of Maroon 5’s Billboard Hot 100 charting songs were written completely and solely by members of Maroon 5 until just before 2015. After this point, all the charting songs have multiple writing credits given to writers not in Maroon 5. This indicates increased outside influence in the band’s later music.
Some further preprocessing will be done to tidy the lyric data. Then the number of total words and unique stop words will be counted, and the number of unique words divided by the total number of words will be used as a metric to give some measure of lyrical repetition in the song. Furthermore, the average word length in the song will be recorded, as well the number of words divdided by the number of seconds in the song to get the words per second. Repetition, or the the number of unique words divided by the total number of words, will be considered most important and thus weighed most heavily. The average length of words in the song will be considered second most important and weighed just below the measure of repetition, and the average number of syllables in each word in the song as well as the number of words per second will be weighed the lightest.
lyricalComplexDf = getLyricalComplexity(archArtist)
## Joining, by = "word"
#lyricalComplexDf
ggplot(lyricalComplexDf, aes(x = ReleaseDate, y = lyricalComplexity)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Lyrical Complexity of Maroon 5 Songs by Release Date")
When the lyrical complexity of Maroon 5’s Billboard charting songs are plotted against time, it is aparrent that there is a negative assosication between lyrical complexity and time for Maroon 5 charting songs. By far the least lyrically complex song was one of the more recent, and released around 2017. While this is worth noting, it is should also be said that this point and a point around 2002 given a very high score are both arguably high leverage points.
Previously, the music data was held for each section of each song, but it will need to be aggregated to each song. Now for each song measures of the number of unique chords, non-diatonic chords, extended chords, number of sections, and the number of section ends that are different will be held. It should be noted that not all songs will have music chord data, so these units will recieve 0 for musical complexity after the present complexity levels are standardized. This is so these songs will not affect the total complexity of a song which will be calculated later using the standardized lyrical and musical complexities.
The musical complexity score is computed by weighing the number of non-ditonic chords, or chords outside of the key the song is in that are not expected to be heard, and the numebr of unique chords in the song more than the number of extended chords and the number of sections which are different, as these are argueably less difinitive measures of musical complexity.
musicComplexDf = getMusicComplexity(archArtist)
#musicComplexDf
ggplot(musicComplexDf, aes(x = ReleaseDate, y = musicalComplexity)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Musical Complexity of Maroon 5 Songs By Release Date")
There is less music chord data to go off of, but what is there shows Maroon 5 as scoring lower muiscal complexity scores in their later released songs in comparison to their earlier released material.
Now all of the smaller metric datasets will be joined and all of the columns other than the count of writers who are not in the band will be standardized.
artistMetricDf = fullMetricsDataSet(archArtistPop, archArtistInfluence, lyricalComplexDf, musicComplexDf)
#artistMetricDf
#artistMetricDf %>%
# select(Name, pop1, pop2, pop3, pop4) %>%
# gather(key = "Metric", value = "Score", -Name) %>%
# ggplot(aes(x = Name, y = Score, fill = Metric)) + geom_col(position = "dodge") + labs(title = "Comparison of Popularity Metrics Across Maroon 5 Billboard Hot 100 Songs") + theme(axis.text.x = element_text(angle = 90))
Now that all of the metrtic data is collected along with the original data on the artist’s songs, their tracks can be compared directly to each other in some meaningful ways. First a function will be made to compare chosen tracks of a particular artist. The result is a few plots tracking each song’s life on the Billboard Hot 100, and a plot giving the track’s contribution to the pop1 metric each week. Then there are some formatted and color coded tables to summarize the metric data. Things shaded red are below the average when standardized and the green are above 0. This is run on a selection of Maroon 5 songs below.
#Can track pop1 metric over time because weeks is a changing metric
#Join all of the originality and complexity metrics because they are attatched to the song, not moving by week
tables = compareTracks(c("She will be loved", "Girls like you"), archArtist, artistMetricDf)
#tables = compareTracks(c("She will be loved", "Harder to Breathe", "Wait", "Sugar"), archArtist, artistMetricDf)
kable(tables[1]) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
|
as.data.frame(tables[2]) %>%
mutate(pop1 = cell_spec(pop1, "html",color = ifelse(pop1 > 0,"green", "red")),
pop2 = cell_spec(pop2, "html",color = ifelse(pop2 > 0,"green", "red")),
pop3 = cell_spec(pop3, "html",color = ifelse(pop3 > 0,"green", "red")),
pop4 = cell_spec(pop4, "html",color = ifelse(pop4 > 0,"green", "red")),) %>%
kable("html", escape = FALSE) %>%
kable_styling()
| Name | ReleaseDate | pop1 | pop2 | pop3 | pop4 | GrammyAward | RiaaStatus |
|---|---|---|---|---|---|---|---|
| Girls Like you | 2018-05-30 | 3.04448437922342 | 2.91601413516202 | 0.713434951027756 | 0.944693537168094 | NA | 1x Platinum |
| She will be Loved | 2002-06-25 | 0.146558886941114 | 0.0991218159536854 | 0.635912876919234 | 0.80821799187585 | NA | 4x Multi-Platinum |
as.data.frame(tables[3]) %>%
mutate(totalComplexity = cell_spec(totalComplexity, "html",color = ifelse(totalComplexity > 0,"green", "red"))) %>%
kable("html", escape = FALSE) %>%
kable_styling()
| Name | ReleaseDate | lyricalComplexity | musicalComplexity | totalComplexity |
|---|---|---|---|---|
| She will be Loved | 2002-06-25 | 0.1706889 | 0.2172540 | 0.387942887115398 |
| Girls Like you | 2018-05-30 | -0.4556603 | -0.4554724 | -0.911132657374917 |
| Girls Like you | 2018-05-30 | -0.5583044 | -0.4554724 | -1.01377675470988 |
In the comparison of an older and very popular charting song of Maroon 5’s called “She will be Loved” to a newer song of theirs that is indicated as their most commercially popular, “Girls Like you”, it is apparant that small differences in how the tracks performed on the charts had significant differences in their contributions to the pop1 metric. “Girls Like you” was on the chart for about 10 weeks longer which certainly was an indication of greater popularity rewarded with continual addition to the pop1 score, but also it maintained its peak spot on the charts at number 1 for much longer. “She will be Loved” climbed to its peak and fell out of the top 10 in the same number of weeks that “Girls Like you” maintained the top spot. This maintainence was heavily rewarded in greater and greater pop1 score additions each week, and is really the reason why “Girls Like you” received such a great value in that metric.
Supplementary to this, the tables indicate that “Girls Like you” had 6 outside writers whereas “She will be Loved” had none. Both songs had over-average popularity for Maroon 5 songs, but “Girls Like you” had greater scores across the metrics. What is note-worthy is that “She will be Loved” had an over-average complexity and “Girls Like you” was about 1 standard deviation under the average in complexity, with only a minor difference being made for the version that is more complex with a rap verse.
All of the previously created functionality should be able to be applied to any valid artist that there is available data on. The full pipeline of function calls is below. The input just requires the artist name, the names of those given the writing credits of a track who are representing the artist or group, and the valid albums which should be considered.
#Need to pass artist, and valid albs
completeArchDf("Maroon 5", c("Adam Levine", "Jesse Carmichael", "Mickey Madden", "James Valentine", "Matt Flynn", "PJ Morton", "Sam Farrar", "Ryan Dusick"), c("Red Pill Blues + (Deluxe)", "v (Deluxe)", "Overexposed Track by Track", "Hands all over (Deluxe)", "it Won't be Soon Before Long.", "Songs About Jane"), c()) #May be 2 versions of girls like you - one with rap and one without
## Joining, by = "word"
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
From the simple linear regressions of each of the three metrics against the release date of all of Maroon 5’s charting songs, it seems there was a moderate decrease in their song complexity, a slight increase in their song popularity, and drastic increase in outside influence to their music as time passed.
completeArchDf("Justin Timberlake", c("Justin Timberlake"), c("Justified", "Man of the Woods", "The 20/20 Experience - 2 of 2 (Deluxe)", "The 20/20 Experience (Deluxe Version)", "Futuresex/Lovesounds Deluxe Edition"), c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of Justin Timberlake’s charting songs, it seems there was a insignificant change in his song complexity, a very slight decrease in his song popularity, and a slight increase in outside influence to his music as time passed.
completeArchDf("Twenty One Pilots", c("Tyler Joseph", "Josh Dun", "Nick Thomas", "Chris Salih"), c("Trench", "Blurryface","Vessel (with Bonus Tracks)", "Twenty One Pilots"), c("Cancer"))#Cancer was a cover so it is excluded even though it made the chart
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of Twenty One Pilot’s charting songs, it seems there was a very slight decrease in their song complexity, a slight decrease in their song popularity, and no change in outside influence to their music as time passed.
completeArchDf("Foo Fighters", c("Dave Grohl", "Nate Mendel", "Pat Smear", "Taylor Hawkins", "Chris Shiflett", "Rami Jaffee", "William Goldsmith", "Franz Stahl"), c("Wasting Light", "Echoes, Silence, Patience & Grace", "In your Honor", "One by One (Expanded Edition)", "There is Nothing Left to Lose", "The Colour and the Shape", "Concrete and Gold", "Foo Fighters", "Sonic Highways"), c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of the Foo Fighter’s charting songs, it seems there was a significant increase in their song complexity, a significant decrease in their song popularity, and no change in outside influence to their music as time passed.
completeArchDf("Taylor Swift", c("Taylor Swift"), c("Reputation", "1989 (Deluxe Edition)", "Red (Deluxe Edition)", "Speak Now (Deluxe Edition)", "Fearless (Platinum Edition)", "Taylor Swift"), c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of Taylor Swift’s charting songs, it seems there was slight decrease in her song complexity, a very slight increase in her song popularity, and a significant increase in outside influence to her music as time passed.
completeArchDf("Justin Bieber", c("Justin Bieber"), c("Purpose (Deluxe)", "Journals", "Believe (Deluxe Edition)", "under the Mistletoe (Deluxe Edition)", "My World 2.0", "Never Say Never - The Remixes", "My World"), c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of Justin Bieber’s charting songs, it seems there was a very slight increase in his song complexity, a slight increase in his song popularity, and an insignificant change in outside influence to his music as time passed.
completeArchDf("Britney Spears", c("Britney Spears"), c("Britney Jean (Deluxe Version)", "Femme Fatale (Deluxe Version)", "Circus (Deluxe Version)", "Blackout", "In the Zone", "Britney (Digital Deluxe Version)", "Oops!... i Did it Again", "...baby One more Time (Digital Deluxe Version)", "Glory (Deluxe Version)") ,c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of Taylor Swift’s charting songs, it seems there was moderate increase in her song complexity, a very slight increase in her song popularity, and a moderate increase in outside influence to her music as time passed.
completeArchDf("j Cole", c("j Cole"), c("Revenge of the Dreamers Iii", "Kod", "2014 Forest Hills Drive", "Cole World: The Sideline Story", "4 your Eyez Only", "Born Sinner", "The Blow Up"), c())
## Joining, by = "word"
From the simple linear regressions of each of the three metrics against the release date of all of J Cole’s charting songs, it seems there was moderate increase in his song complexity, a moderate decrease in his song popularity, and a decrease in outside influence to his music as time passed.